This dataset contains a survey of students in a Math course and a Portuguese language course at two secondary schools in Portugal (Gabriel Pereira and Mousinho da Silveira). The dataframes contains a lot of important and interesting social, gender and study information regarding the students at the schools.
The contributor of the data suggested that the data could be great for running predictive models such as predicting students final grades. Besides the recommendation from the data’s poster, there were a couple other reasons why I choose the dataset. This dataset seemed intresting and easy to work with (including pretty decent metadata), looked like it had enough data with dependent and independent variables to do multiple regression, and contained two csv’s allowing me to perform a join. These factors made me think it would be perfect for this midterm assingment.
I hope to learn about what socioeconimic and social backgrounds can impact students ability to perform in acadamia and/or find factors that lead to alcohol consumption. I also would be intrested in looking at the differences between each school, maybe again looking at grades and alcohol consumption.
Before we get to the fun stuff I have to actaully put my data in R…
setwd("C:/Users/carso/OneDrive/Desktop/MARE 375 Midterm")
library(readr)
student_mat <- read_csv("student-mat.csv")
library(readr)
student_por <- read_csv("student-por.csv")
Changing names for the two dataframes, because I don’t want to type more than needed.
sm <- student_mat
sp <- student_por
sm
## # A tibble: 395 x 33
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 GP F 18 U GT3 A 4 4 at_h~ teac~
## 2 GP F 17 U GT3 T 1 1 at_h~ other
## 3 GP F 15 U LE3 T 1 1 at_h~ other
## 4 GP F 15 U GT3 T 4 2 heal~ serv~
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 serv~ other
## 7 GP M 16 U LE3 T 2 2 other other
## 8 GP F 17 U GT3 A 4 4 other teac~
## 9 GP M 15 U LE3 A 3 2 serv~ other
## 10 GP M 15 U GT3 T 3 4 other other
## # ... with 385 more rows, and 23 more variables: reason <chr>,
## # guardian <chr>, traveltime <dbl>, studytime <dbl>, failures <dbl>,
## # schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>,
## # nursery <chr>, higher <chr>, internet <chr>, romantic <chr>,
## # famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>, Walc <dbl>,
## # health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>
sp
## # A tibble: 649 x 33
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 GP F 18 U GT3 A 4 4 at_h~ teac~
## 2 GP F 17 U GT3 T 1 1 at_h~ other
## 3 GP F 15 U LE3 T 1 1 at_h~ other
## 4 GP F 15 U GT3 T 4 2 heal~ serv~
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 serv~ other
## 7 GP M 16 U LE3 T 2 2 other other
## 8 GP F 17 U GT3 A 4 4 other teac~
## 9 GP M 15 U LE3 A 3 2 serv~ other
## 10 GP M 15 U GT3 T 3 4 other other
## # ... with 639 more rows, and 23 more variables: reason <chr>,
## # guardian <chr>, traveltime <dbl>, studytime <dbl>, failures <dbl>,
## # schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>,
## # nursery <chr>, higher <chr>, internet <chr>, romantic <chr>,
## # famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>, Walc <dbl>,
## # health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>
By using the str() command, I am able to see how many observations and variables are in my dataframes and the type of data :
str(sm)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 395 obs. of 33 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : num 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : num 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : num 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: num 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : num 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : num 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : num 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : num 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : num 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : num 6 6 10 15 10 15 11 6 19 15 ...
## - attr(*, "spec")=
## .. cols(
## .. school = col_character(),
## .. sex = col_character(),
## .. age = col_double(),
## .. address = col_character(),
## .. famsize = col_character(),
## .. Pstatus = col_character(),
## .. Medu = col_double(),
## .. Fedu = col_double(),
## .. Mjob = col_character(),
## .. Fjob = col_character(),
## .. reason = col_character(),
## .. guardian = col_character(),
## .. traveltime = col_double(),
## .. studytime = col_double(),
## .. failures = col_double(),
## .. schoolsup = col_character(),
## .. famsup = col_character(),
## .. paid = col_character(),
## .. activities = col_character(),
## .. nursery = col_character(),
## .. higher = col_character(),
## .. internet = col_character(),
## .. romantic = col_character(),
## .. famrel = col_double(),
## .. freetime = col_double(),
## .. goout = col_double(),
## .. Dalc = col_double(),
## .. Walc = col_double(),
## .. health = col_double(),
## .. absences = col_double(),
## .. G1 = col_double(),
## .. G2 = col_double(),
## .. G3 = col_double()
## .. )
str(sp)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 649 obs. of 33 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : num 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : num 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : num 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: num 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : num 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ paid : chr "no" "no" "no" "no" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : num 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : num 4 2 6 0 0 6 0 2 0 0 ...
## $ G1 : num 0 9 12 14 11 12 13 10 15 12 ...
## $ G2 : num 11 11 13 14 13 12 12 13 16 12 ...
## $ G3 : num 11 11 12 14 13 13 13 13 17 13 ...
## - attr(*, "spec")=
## .. cols(
## .. school = col_character(),
## .. sex = col_character(),
## .. age = col_double(),
## .. address = col_character(),
## .. famsize = col_character(),
## .. Pstatus = col_character(),
## .. Medu = col_double(),
## .. Fedu = col_double(),
## .. Mjob = col_character(),
## .. Fjob = col_character(),
## .. reason = col_character(),
## .. guardian = col_character(),
## .. traveltime = col_double(),
## .. studytime = col_double(),
## .. failures = col_double(),
## .. schoolsup = col_character(),
## .. famsup = col_character(),
## .. paid = col_character(),
## .. activities = col_character(),
## .. nursery = col_character(),
## .. higher = col_character(),
## .. internet = col_character(),
## .. romantic = col_character(),
## .. famrel = col_double(),
## .. freetime = col_double(),
## .. goout = col_double(),
## .. Dalc = col_double(),
## .. Walc = col_double(),
## .. health = col_double(),
## .. absences = col_double(),
## .. G1 = col_double(),
## .. G2 = col_double(),
## .. G3 = col_double()
## .. )
Attributes for both dataframes:
school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
age - student’s age (numeric: from 15 to 22)
address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
traveltime - home to school travel time (numeric: 1 - 3 hours)
studytime - weekly study time (numeric: 1 - 10 hours)
failures - number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup - extra educational support (binary: yes or no)
famsup - family educational support (binary: yes or no)
paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities - extra-curricular activities (binary: yes or no)
nursery - attended nursery school (binary: yes or no)
higher - wants to take higher education (binary: yes or no)
internet - Internet access at home (binary: yes or no)
romantic - with a romantic relationship (binary: yes or no)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
goout - going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health - current health status (numeric: from 1 - very bad to 5 - very good)
absences - number of school absences (numeric: from 0 to 93)
Let’s see if the data is Tidy!
head(sm)
## # A tibble: 6 x 33
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 GP F 18 U GT3 A 4 4 at_h~ teac~ course
## 2 GP F 17 U GT3 T 1 1 at_h~ other course
## 3 GP F 15 U LE3 T 1 1 at_h~ other other
## 4 GP F 15 U GT3 T 4 2 heal~ serv~ home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 serv~ other reput~
## # ... with 22 more variables: guardian <chr>, traveltime <dbl>,
## # studytime <dbl>, failures <dbl>, schoolsup <chr>, famsup <chr>,
## # paid <chr>, activities <chr>, nursery <chr>, higher <chr>,
## # internet <chr>, romantic <chr>, famrel <dbl>, freetime <dbl>,
## # goout <dbl>, Dalc <dbl>, Walc <dbl>, health <dbl>, absences <dbl>,
## # G1 <dbl>, G2 <dbl>, G3 <dbl>
head(sp)
## # A tibble: 6 x 33
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 GP F 18 U GT3 A 4 4 at_h~ teac~ course
## 2 GP F 17 U GT3 T 1 1 at_h~ other course
## 3 GP F 15 U LE3 T 1 1 at_h~ other other
## 4 GP F 15 U GT3 T 4 2 heal~ serv~ home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 serv~ other reput~
## # ... with 22 more variables: guardian <chr>, traveltime <dbl>,
## # studytime <dbl>, failures <dbl>, schoolsup <chr>, famsup <chr>,
## # paid <chr>, activities <chr>, nursery <chr>, higher <chr>,
## # internet <chr>, romantic <chr>, famrel <dbl>, freetime <dbl>,
## # goout <dbl>, Dalc <dbl>, Walc <dbl>, health <dbl>, absences <dbl>,
## # G1 <dbl>, G2 <dbl>, G3 <dbl>
In my data, each variable has its own column and each observation hsd its own row.
Sadly, I don’t know if my data has unique keys.
I can’t move forward until my data is 100% tidy, so I will check for unique keys
I will want to see if my data has any primary keys
I am going to use many identifiers that I think will help me reach a primary key:
Math Course
# Making sure my packages are loaded
detach(package:readr, unload = TRUE)
library(tidyverse)
## Counting repeated observations to see if I have a primary key
sm %>%
count(school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,nursery,internet) %>%
filter(n >1)
## # A tibble: 4 x 14
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 GP M 16 U GT3 T 3 3 serv~ other home
## 2 GP M 16 U GT3 T 4 4 teac~ teac~ course
## 3 GP M 17 U GT3 T 2 1 other other home
## 4 GP M 18 U GT3 T 4 4 teac~ serv~ home
## # ... with 3 more variables: nursery <chr>, internet <chr>, n <int>
Portuguese Language Course
sp %>%
count(school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,nursery,internet) %>%
filter(n >1)
## # A tibble: 11 x 14
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 GP F 17 U GT3 T 4 4 teac~ serv~
## 2 GP F 18 U GT3 T 1 1 other other
## 3 GP M 16 U GT3 T 3 3 serv~ other
## 4 GP M 16 U GT3 T 4 4 serv~ serv~
## 5 GP M 16 U GT3 T 4 4 teac~ teac~
## 6 GP M 17 U GT3 T 2 1 other other
## 7 GP M 18 U GT3 T 4 4 teac~ serv~
## 8 MS F 16 R GT3 T 2 2 other other
## 9 MS F 16 U GT3 T 1 2 other serv~
## 10 MS F 18 U GT3 T 1 1 other other
## 11 MS M 15 R LE3 T 4 1 heal~ serv~
## # ... with 4 more variables: reason <chr>, nursery <chr>, internet <chr>,
## # n <int>
It looks like with this set of identifiers, we do not have a primary key
Math Course
sm %>%
count(school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,nursery,internet, G1,G2,G3) %>%
filter(n >1)
## # A tibble: 0 x 17
## # ... with 17 variables: school <chr>, sex <chr>, age <dbl>,
## # address <chr>, famsize <chr>, Pstatus <chr>, Medu <dbl>, Fedu <dbl>,
## # Mjob <chr>, Fjob <chr>, reason <chr>, nursery <chr>, internet <chr>,
## # G1 <dbl>, G2 <dbl>, G3 <dbl>, n <int>
Portuguese Language Course
sp %>%
count(school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,nursery,internet, G1,G2,G3) %>%
filter(n >1)
## # A tibble: 1 x 17
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 MS F 16 R GT3 T 2 2 other other course
## # ... with 6 more variables: nursery <chr>, internet <chr>, G1 <dbl>,
## # G2 <dbl>, G3 <dbl>, n <int>
It looks like I still don’t have a primary key for both schools :(
library(dplyr)
#### Adding ID's to my data
smi <- mutate(sm,id = dplyr::row_number(G1))
spi <- mutate(sp,id = dplyr::row_number(G1))
Made ID’s by grade because it was rather variable
smi
## # A tibble: 395 x 34
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 GP F 18 U GT3 A 4 4 at_h~ teac~
## 2 GP F 17 U GT3 T 1 1 at_h~ other
## 3 GP F 15 U LE3 T 1 1 at_h~ other
## 4 GP F 15 U GT3 T 4 2 heal~ serv~
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 serv~ other
## 7 GP M 16 U LE3 T 2 2 other other
## 8 GP F 17 U GT3 A 4 4 other teac~
## 9 GP M 15 U LE3 A 3 2 serv~ other
## 10 GP M 15 U GT3 T 3 4 other other
## # ... with 385 more rows, and 24 more variables: reason <chr>,
## # guardian <chr>, traveltime <dbl>, studytime <dbl>, failures <dbl>,
## # schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>,
## # nursery <chr>, higher <chr>, internet <chr>, romantic <chr>,
## # famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>, Walc <dbl>,
## # health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>, id <int>
Primary key check:
smi %>% count(id) %>% filter(n >1)
## # A tibble: 0 x 2
## # ... with 2 variables: id <int>, n <int>
spi
## # A tibble: 649 x 34
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 GP F 18 U GT3 A 4 4 at_h~ teac~
## 2 GP F 17 U GT3 T 1 1 at_h~ other
## 3 GP F 15 U LE3 T 1 1 at_h~ other
## 4 GP F 15 U GT3 T 4 2 heal~ serv~
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 serv~ other
## 7 GP M 16 U LE3 T 2 2 other other
## 8 GP F 17 U GT3 A 4 4 other teac~
## 9 GP M 15 U LE3 A 3 2 serv~ other
## 10 GP M 15 U GT3 T 3 4 other other
## # ... with 639 more rows, and 24 more variables: reason <chr>,
## # guardian <chr>, traveltime <dbl>, studytime <dbl>, failures <dbl>,
## # schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>,
## # nursery <chr>, higher <chr>, internet <chr>, romantic <chr>,
## # famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>, Walc <dbl>,
## # health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>, id <int>
Primary key check:
spi %>% count(id) %>% filter(n >1)
## # A tibble: 0 x 2
## # ... with 2 variables: id <int>, n <int>
Looks like I have made a primary key!
Okay time for imputations!
Loading the packages I need!
detach(package:tidyverse, unload = TRUE)
library(mice)
Looking for missing data in my dataframes
Math Course
md.pattern(smi)
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 395 1 1 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0 0 0
## guardian traveltime studytime failures schoolsup famsup paid
## 395 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0
## activities nursery higher internet romantic famrel freetime goout Dalc
## 395 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0
## Walc health absences G1 G2 G3 id
## 395 1 1 1 1 1 1 1 0
## 0 0 0 0 0 0 0 0
Portuguese Language Course
md.pattern(spi)
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 649 1 1 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0 0 0
## guardian traveltime studytime failures schoolsup famsup paid
## 649 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0
## activities nursery higher internet romantic famrel freetime goout Dalc
## 649 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0
## Walc health absences G1 G2 G3 id
## 649 1 1 1 1 1 1 1 0
## 0 0 0 0 0 0 0 0
My data doesn’t have any missing values, so I will have to delete some stuff before I make imputations!
Deleting stuff and making NA!
library(naniar)
smi2 <- smi %>% replace_with_na_all(condition = ~.x == 2)
spi2 <- spi %>% replace_with_na_all(condition = ~.x == 2)
Decided to replace all twos with NA cause that seemed fun
smi2
## # A tibble: 395 x 34
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 GP F 18 U GT3 A 4 4 at_h~ teac~
## 2 GP F 17 U GT3 T 1 1 at_h~ other
## 3 GP F 15 U LE3 T 1 1 at_h~ other
## 4 GP F 15 U GT3 T 4 NA heal~ serv~
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 serv~ other
## 7 GP M 16 U LE3 T NA NA other other
## 8 GP F 17 U GT3 A 4 4 other teac~
## 9 GP M 15 U LE3 A 3 NA serv~ other
## 10 GP M 15 U GT3 T 3 4 other other
## # ... with 385 more rows, and 24 more variables: reason <chr>,
## # guardian <chr>, traveltime <dbl>, studytime <dbl>, failures <dbl>,
## # schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>,
## # nursery <chr>, higher <chr>, internet <chr>, romantic <chr>,
## # famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>, Walc <dbl>,
## # health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>, id <int>
spi2
## # A tibble: 649 x 34
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 GP F 18 U GT3 A 4 4 at_h~ teac~
## 2 GP F 17 U GT3 T 1 1 at_h~ other
## 3 GP F 15 U LE3 T 1 1 at_h~ other
## 4 GP F 15 U GT3 T 4 NA heal~ serv~
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 serv~ other
## 7 GP M 16 U LE3 T NA NA other other
## 8 GP F 17 U GT3 A 4 4 other teac~
## 9 GP M 15 U LE3 A 3 NA serv~ other
## 10 GP M 15 U GT3 T 3 4 other other
## # ... with 639 more rows, and 24 more variables: reason <chr>,
## # guardian <chr>, traveltime <dbl>, studytime <dbl>, failures <dbl>,
## # schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>,
## # nursery <chr>, higher <chr>, internet <chr>, romantic <chr>,
## # famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>, Walc <dbl>,
## # health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>, id <int>
Performing the imputation!
Using PMM (Predictive Mean Matching) for method, because I am doing the imputation on numeric variables.
“smii <- mice(smi2, m = 5, maxit = 50, method = ‘pmm’, seed = 500)”
“spii <- mice(spi2, m = 5, maxit = 50, method = ‘pmm’, seed = 500)”
This isn’t in a coding box because the output is way too long, but I ran it below without showing the output.
Math Course:
smii_imp <- complete(smii,3)
Portuguese Language Course :
spii_imp <- complete(spii,3)
head(smii_imp)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 4 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason guardian traveltime studytime failures schoolsup famsup paid
## 1 course mother 1 1 0 yes no no
## 2 course father 1 1 0 no yes no
## 3 other mother 1 1 3 yes no yes
## 4 home mother 1 3 0 no yes yes
## 5 home father 1 1 0 no yes yes
## 6 reputation mother 1 1 0 no yes yes
## activities nursery higher internet romantic famrel freetime goout Dalc
## 1 no yes yes no no 4 3 4 1
## 2 no no yes yes no 5 3 3 1
## 3 no yes yes yes no 4 3 4 1
## 4 yes yes yes yes yes 3 4 4 1
## 5 no yes yes no no 4 3 5 1
## 6 yes yes yes yes no 5 4 3 1
## Walc health absences G1 G2 G3 id
## 1 1 3 6 5 6 6 3
## 2 1 3 4 5 5 6 4
## 3 3 3 10 7 8 10 34
## 4 1 5 12 15 14 15 331
## 5 4 5 4 6 10 10 10
## 6 1 5 10 15 15 15 332
head(spii_imp)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 4 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason guardian traveltime studytime failures schoolsup famsup paid
## 1 course mother 1 1 0 yes no no
## 2 course father 1 1 0 no yes no
## 3 other mother 1 1 0 yes no no
## 4 home mother 1 3 0 no yes no
## 5 home father 1 3 0 no yes no
## 6 reputation mother 1 3 0 no yes no
## activities nursery higher internet romantic famrel freetime goout Dalc
## 1 no yes yes no no 4 3 4 1
## 2 no no yes yes no 5 3 3 1
## 3 no yes yes yes no 4 3 3 1
## 4 yes yes yes yes yes 3 4 3 1
## 5 no yes yes no no 4 3 3 1
## 6 yes yes yes yes no 5 4 3 1
## Walc health absences G1 G2 G3 id
## 1 1 3 4 0 11 11 1
## 2 1 3 0 9 11 11 93
## 3 3 3 6 12 13 12 344
## 4 1 5 0 14 14 14 498
## 5 1 5 0 11 13 13 253
## 6 1 5 6 12 12 13 345
It seems like the imputation did a decent job. It replaced all of the NA values with one, which is close to two, but not quite!
Because my two dataframes both deal with students in either a Math course or a Portuguese language course at two secondary schools in Portugal (Gabriel Pereira and Mousinho da Silveira), I can combine the data by merging the students who are in both classes together.
The metadata of the datset I am using said that there is an overlap of 382 students that are in both the math class and the potuguese language class. They gave a suggestion on what to merge to capture the 382 students. Below is the code I used to merge both classes into one master data frame.
It was suggested to merge the two data sets by the following variables:
“school”,“sex”,“age”,“address”,“famsize”,“Pstatus”,“Medu”,“Fedu”,“Mjob”,“Fjob”,“reason”,“nursery”,“internet”
Here is the merge they suggested
All = merge(smi,spi,by=c("school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet"))
print(nrow(All)) # 382 students
## [1] 382
I am intrested in seeing if this combined data set actually contains 382 unique students who attend both schools, so I will do some more counts
All %>% count(school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,nursery,internet) %>% filter(n >1)
## # A tibble: 8 x 14
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 GP F 17 U GT3 T 4 4 teac~ serv~ course
## 2 GP F 18 U GT3 T 1 1 other other home
## 3 GP M 16 U GT3 T 3 3 serv~ other home
## 4 GP M 16 U GT3 T 4 4 serv~ serv~ course
## 5 GP M 16 U GT3 T 4 4 teac~ teac~ course
## 6 GP M 17 U GT3 T 2 1 other other home
## 7 GP M 18 U GT3 T 4 4 teac~ serv~ home
## 8 MS F 18 U GT3 T 1 1 other other course
## # ... with 3 more variables: nursery <chr>, internet <chr>, n <int>
Seems like this isn’t truly unique counts of students. I guess I will take a crack at it!
I think I will merge the data with all variables to make sure that we can see what is going on I did not include G1,G2,G3 or paid variables. This is because the paid variable shouldn’t effect our merge and the grades need to be seperate for each class.
Merge by default uses an inner join to join the data, which is what I need for this join
All2 <- merge(smi,spi,by=c("school","sex","age","address","famsize","Pstatus",
"Medu","Fedu","Mjob","Fjob","reason","nursery","internet",
"guardian","traveltime","studytime","failures",
"schoolsup","famsup","activities","higher","romantic",
"famrel","freetime","goout","Dalc","Walc","health","absences"))
print(nrow(All2)) # 85 students
## [1] 85
Time to check if we have repeats!
All2 %>% count(school,sex,age,address,famsize,Pstatus,
Medu,Fedu,Mjob,Fjob,reason,nursery,internet, guardian,traveltime,studytime,failures,
schoolsup,famsup,activities,higher,romantic,
famrel,freetime,goout,Dalc,Walc,health,absences) %>% filter(n >1)
## # A tibble: 0 x 30
## # ... with 30 variables: school <chr>, sex <chr>, age <dbl>,
## # address <chr>, famsize <chr>, Pstatus <chr>, Medu <dbl>, Fedu <dbl>,
## # Mjob <chr>, Fjob <chr>, reason <chr>, nursery <chr>, internet <chr>,
## # guardian <chr>, traveltime <dbl>, studytime <dbl>, failures <dbl>,
## # schoolsup <chr>, famsup <chr>, activities <chr>, higher <chr>,
## # romantic <chr>, famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>,
## # Walc <dbl>, health <dbl>, absences <dbl>, n <int>
Looks like we do not have any repeated rows!
Now that I have joined my data, I want to make some dataframes that I can use for plots!
Master <- All2
I should rename some of my rows so my master dataframe is easier to interpret!
library(plyr)
Master <- rename(Master, c("paid.x" = "paid.m", "G1.x" = "G1.m", "G2.x" = "G2.m", "G3.x" = "G3.m", "id.x" = "id.m", "paid.y" = "paid.p", "G1.y" = "G1.p", "G2.y" = "G2.p", "G3.y" = "G3.p", "id.y" = "id.p"))
Renamed the x’s to m for Math course and the y’s to p for the Portuguese language class.
head(Master)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 15 R GT3 T 2 2 at_home other
## 2 GP F 15 R GT3 T 2 4 services health
## 3 GP F 15 R GT3 T 3 4 services health
## 4 GP F 15 U GT3 A 3 3 other health
## 5 GP F 15 U GT3 A 4 3 services services
## 6 GP F 15 U GT3 T 1 1 other other
## reason nursery internet guardian traveltime studytime failures
## 1 reputation yes no mother 1 1 0
## 2 course yes yes mother 1 3 0
## 3 course yes yes mother 1 3 0
## 4 reputation yes no father 1 4 0
## 5 reputation yes yes mother 1 2 0
## 6 home no yes father 1 2 0
## schoolsup famsup activities higher romantic famrel freetime goout Dalc
## 1 yes yes yes yes no 4 3 1 1
## 2 yes yes yes yes no 4 3 2 1
## 3 yes yes yes yes no 4 3 2 1
## 4 yes no no yes no 4 3 3 1
## 5 no yes yes yes no 4 3 2 1
## 6 no yes yes yes no 4 3 2 2
## Walc health absences paid.m G1.m G2.m G3.m id.m paid.p G1.p G2.p G3.p
## 1 1 2 8 yes 14 13 13 305 no 14 13 12
## 2 1 5 2 yes 10 9 8 146 no 10 11 10
## 3 1 5 2 yes 12 12 11 237 no 11 12 12
## 4 1 4 10 no 10 11 11 157 no 10 10 10
## 5 1 1 0 yes 14 15 15 306 no 15 14 15
## 6 3 4 2 no 9 10 10 115 no 13 12 12
## id.p
## 1 502
## 2 161
## 3 258
## 4 171
## 5 571
## 6 443
This datframe looks good, but I also want to make a data frame without any categorical variables!
I want to keep all of the possible predictor variables so that I can compare and contrast them with each other, but this next dataframe will just have numbers.
Mastercor <- select(Master,-c(school,sex,address,famsize,Pstatus,Mjob,Fjob,reason,nursery,internet,guardian,schoolsup,famsup,activities,higher,romantic,paid.m,paid.p,id.m,id.p,G1.m,G2.m,G1.p,G2.p))
Time to get an idea of what is going with my dataset!
First I’m going to use some base functions to look at my data
head(Master)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 15 R GT3 T 2 2 at_home other
## 2 GP F 15 R GT3 T 2 4 services health
## 3 GP F 15 R GT3 T 3 4 services health
## 4 GP F 15 U GT3 A 3 3 other health
## 5 GP F 15 U GT3 A 4 3 services services
## 6 GP F 15 U GT3 T 1 1 other other
## reason nursery internet guardian traveltime studytime failures
## 1 reputation yes no mother 1 1 0
## 2 course yes yes mother 1 3 0
## 3 course yes yes mother 1 3 0
## 4 reputation yes no father 1 4 0
## 5 reputation yes yes mother 1 2 0
## 6 home no yes father 1 2 0
## schoolsup famsup activities higher romantic famrel freetime goout Dalc
## 1 yes yes yes yes no 4 3 1 1
## 2 yes yes yes yes no 4 3 2 1
## 3 yes yes yes yes no 4 3 2 1
## 4 yes no no yes no 4 3 3 1
## 5 no yes yes yes no 4 3 2 1
## 6 no yes yes yes no 4 3 2 2
## Walc health absences paid.m G1.m G2.m G3.m id.m paid.p G1.p G2.p G3.p
## 1 1 2 8 yes 14 13 13 305 no 14 13 12
## 2 1 5 2 yes 10 9 8 146 no 10 11 10
## 3 1 5 2 yes 12 12 11 237 no 11 12 12
## 4 1 4 10 no 10 11 11 157 no 10 10 10
## 5 1 1 0 yes 14 15 15 306 no 15 14 15
## 6 3 4 2 no 9 10 10 115 no 13 12 12
## id.p
## 1 502
## 2 161
## 3 258
## 4 171
## 5 571
## 6 443
summary(Master)
## school sex age address
## Length:85 Length:85 Min. :15.00 Length:85
## Class :character Class :character 1st Qu.:15.00 Class :character
## Mode :character Mode :character Median :16.00 Mode :character
## Mean :16.39
## 3rd Qu.:17.00
## Max. :19.00
## famsize Pstatus Medu Fedu
## Length:85 Length:85 Min. :0.0 Min. :1.000
## Class :character Class :character 1st Qu.:2.0 1st Qu.:2.000
## Mode :character Mode :character Median :3.0 Median :3.000
## Mean :2.8 Mean :2.647
## 3rd Qu.:4.0 3rd Qu.:4.000
## Max. :4.0 Max. :4.000
## Mjob Fjob reason
## Length:85 Length:85 Length:85
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## nursery internet guardian traveltime
## Length:85 Length:85 Length:85 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:1.000
## Mode :character Mode :character Mode :character Median :1.000
## Mean :1.353
## 3rd Qu.:2.000
## Max. :3.000
## studytime failures schoolsup famsup
## Min. :1.000 Min. :0.00000 Length:85 Length:85
## 1st Qu.:2.000 1st Qu.:0.00000 Class :character Class :character
## Median :2.000 Median :0.00000 Mode :character Mode :character
## Mean :2.106 Mean :0.09412
## 3rd Qu.:3.000 3rd Qu.:0.00000
## Max. :4.000 Max. :3.00000
## activities higher romantic famrel
## Length:85 Length:85 Length:85 Min. :2.000
## Class :character Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :4.176
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :3.000 Median :1.000 Median :2.000
## Mean :3.212 Mean :2.812 Mean :1.282 Mean :1.976
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:1.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :3.000 Max. :5.000
## health absences paid.m G1.m
## Min. :1.000 Min. : 0.000 Length:85 Min. : 6.00
## 1st Qu.:3.000 1st Qu.: 0.000 Class :character 1st Qu.: 9.00
## Median :4.000 Median : 0.000 Mode :character Median :12.00
## Mean :3.506 Mean : 1.894 Mean :12.08
## 3rd Qu.:5.000 3rd Qu.: 2.000 3rd Qu.:14.00
## Max. :5.000 Max. :14.000 Max. :19.00
## G2.m G3.m id.m paid.p
## Min. : 5.00 Min. : 0 Min. : 16.0 Length:85
## 1st Qu.:10.00 1st Qu.:10 1st Qu.:131.0 Class :character
## Median :12.00 Median :12 Median :256.0 Mode :character
## Mean :12.24 Mean :12 Mean :236.2
## 3rd Qu.:15.00 3rd Qu.:15 3rd Qu.:329.0
## Max. :19.00 Max. :19 Max. :395.0
## G1.p G2.p G3.p id.p
## Min. : 5.00 Min. : 8.00 Min. : 0.00 Min. : 8.0
## 1st Qu.:11.00 1st Qu.:11.00 1st Qu.:11.00 1st Qu.:278.0
## Median :13.00 Median :12.00 Median :13.00 Median :434.0
## Mean :12.75 Mean :12.82 Mean :13.21 Mean :406.7
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:15.00 3rd Qu.:549.0
## Max. :19.00 Max. :18.00 Max. :19.00 Max. :649.0
plot(Mastercor)
Hard to tell exactly what is going on, I think I will explore further by creating a correlation matrix of some variables I am intrested in so that I can start thinking about how I want to run a statistical model on this data.
Going to do some correlation in hopes of understanding my data
library(dplyr)
#Removing variables that I don't need for my correlation analysis!
Mastercor <- select(Master,-c(school,sex,address,famsize,Pstatus,Mjob,Fjob,reason,nursery,internet,guardian,schoolsup,famsup,activities,higher,romantic,paid.m,paid.p,id.m,id.p,G1.m,G2.m,G1.p,G2.p))
# Running a correlation
Mastercorcor <- cor(Mastercor)
# Need this package to plot!
library(corrplot)
## corrplot 0.84 loaded
# Plotting my findings!
corrplot(Mastercorcor)
By looking at this correlation matrix, I can see some strong negative and positive correlations. Listed below are some that find intresting/and or want to analyze:
traveltime - age - moderate positive
Medu - Fedu - moderate positive
failures - Walc - moderate positive
failures - absences
Dalc - Walc - Strong postive
Failures - Medu and Fedu - strong negative
studytime - Walc and Dalc - moderate negative
goout - Walc - moderate positive
age - G3.m - moderate negative
study time - G3.p - moderate positive
failures - G3.p - moderate negative
health - G3.p - moderate negative
Time to use some plots to determine the best statistical approach
Histograms of some important variables:
hist(Master$G3.m, border="black",
col="orange", main = "Histogram of Math Course Grades", xlab = "Grades for Math Course")
abline(v=mean(Master$G3.m),col="blue")
abline(v=median(Master$G3.m),col="red")
hist(Master$G3.p, border="black",
col="purple", main = "Histogram of Portuguese Language Course Grades", xlab = "Grades for Portuguese Language Course")
abline(v=mean(Master$G3.p),col="blue")
abline(v=median(Master$G3.p),col="red")
hist(Master$studytime, border="black",
col="yellow", main = "Histogram of Studytime", xlab = "Study Time (Hours)")
abline(v=mean(Master$studytime),col="blue")
abline(v=median(Master$studytime),col="red")
hist(Master$failures, border="black",
col="red", main = "Histogram of Failures in Previous classes", xlab = "Failures")
abline(v=mean(Master$failures),col="blue")
abline(v=median(Master$failures),col="green")
hist(Master$age, border="black",
col="green", main = "Histogram of Age", xlab = "Age")
abline(v=mean(Master$age),col="blue")
abline(v=median(Master$age),col="red")
hist(Master$goout, border="black",
col="brown", main = "Histogram of Going out", xlab = "Going out")
abline(v=mean(Master$goout),col="blue")
abline(v=median(Master$goout),col="red")
hist(Master$Fedu, border="black",
col="blue", main = "Histogram of Father's Education", xlab = "Father's Education")
abline(v=mean(Master$Fedu),col="green")
abline(v=median(Master$Fedu),col="red")
hist(Master$Medu, border="black",
col="light blue", main = "Histogram of Mother's Education", xlab = "Mother's Education")
abline(v=mean(Master$Medu),col="green")
abline(v=median(Master$Medu),col="red")
Changing my Master data frame so that it only has final grades and also removing some other unneeded variables
MasterR <- select(Master,-c(id.m,id.p,G1.m,G2.m,G1.p,G2.p,paid.p,paid.m))
I am using Medu, Fedu and studytime in my model, because when I ran my correlation analysis, these three variables seemed to impact grade levels for the portuguese langauge course the most.
library(tidyverse)
#Fitting a basic model to try and predict grades for the Portuguese language course
fit1 <- lm(G3.p ~ Medu + Fedu + studytime, data = MasterR)
summary(fit1)
##
## Call:
## lm(formula = G3.p ~ Medu + Fedu + studytime, data = MasterR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.129 -1.611 0.100 1.332 6.255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6480 1.0425 9.255 2.49e-14 ***
## Medu 0.8653 0.3322 2.604 0.0109 *
## Fedu -0.2892 0.3707 -0.780 0.4376
## studytime 0.9053 0.3482 2.600 0.0111 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.655 on 81 degrees of freedom
## Multiple R-squared: 0.1868, Adjusted R-squared: 0.1566
## F-statistic: 6.2 on 3 and 81 DF, p-value: 0.0007604
Interpretation:
First, we can see that the p-value of the F-statistic is 0.0007604, which is significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable of grades in the portuguese language course.
In this first model, with Medu, Fedu, and studytime as predictor variables, the adjusted R2 = 0.1566. This means that, “15% of the variance in the measure the portuguese language course can be predicted mother’s education level, father’s education level and amount of time spent studying.”
Looking at beta coefficients
summary(fit1)$coefficient
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6480445 1.0424674 9.255009 2.489157e-14
## Medu 0.8652896 0.3322393 2.604416 1.094586e-02
## Fedu -0.2892115 0.3707151 -0.780145 4.375793e-01
## studytime 0.9053067 0.3481821 2.600096 1.107365e-02
By looking at the model we can see that for predicting grades in the Portuguese language course Fedu (Father’s Education Level) is not significant for our model. We can also see that study time has has a greater impact on grades in the portuguese language course than Medu (Mother’s education level), but only by a little.
Because the Fedu variable is not significant, I am going to remove it from the model and compare the new model to the old one.
# New model without Fedu
fit2 <- lm(G3.p ~ Medu + studytime, data = MasterR)
summary(fit2)
##
## Call:
## lm(formula = G3.p ~ Medu + studytime, data = MasterR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9473 -1.5623 0.1271 1.2078 6.1271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3323 0.9584 9.737 2.47e-15 ***
## Medu 0.6894 0.2434 2.832 0.00582 **
## studytime 0.9256 0.3464 2.672 0.00909 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.649 on 82 degrees of freedom
## Multiple R-squared: 0.1806, Adjusted R-squared: 0.1607
## F-statistic: 9.039 on 2 and 82 DF, p-value: 0.0002834
In my model, with Medu and studytime predictor variables, the adjusted R2 = 0.1607. This means that “16% of the variance in the measure of grades in the portuguese language course can be predicted mother’s education level and amount of time spent studying.”
After analyzing my two models, I now want to compare my two models to see which is better at predicting grades for the portuguese language course
anova(fit1, fit2)
## Analysis of Variance Table
##
## Model 1: G3.p ~ Medu + Fedu + studytime
## Model 2: G3.p ~ Medu + studytime
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 571.05
## 2 82 575.34 -1 -4.2908 0.6086 0.4376
If the p-value is not sufficiently low (usually greater than 0.05), we should favor the simpler model. We can see from the table, that we have a Df of -1 (indicating that the less complex model has one less parameter). It also shows a p-value of (0.4376). Meaning that removing the Fedu Variable from the model did lead to a significantly improved fit over the first model.(Favor simpler model)
library(jtools)
library(ggstance)
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
plot_summs(fit1, scale = TRUE, fit2, plot.distributions = TRUE)
par(mfrow=c(2,2))
plot(fit2)
I don’t see any distinct patterns, so I will move on to plotting my model
library(ggplot2)
library(jtools)
library(ggstance)
effect_plot(fit2, pred = studytime, interval = TRUE, plot.points = TRUE)
effect_plot(fit2, pred = Medu, interval = TRUE, plot.points = TRUE)
plot_summs(fit2, scale = TRUE)
plot_summs(fit2, scale = TRUE, plot.distributions = TRUE, inner_ci_level = .9)
Making the two predictors and the response into specific objects:
var_1 <- Master$studytime
var_2 <- Master$Medu
var_3 <- Master$G3.p
Plot time!
library(plotly)
##
## Attaching package: 'plotly'
## The following objects are masked from 'package:plyr':
##
## arrange, mutate, rename, summarise
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p <- plot_ly(Master,
x = var_1,
y = var_2,
z = var_3) %>%
add_markers(color = ~G3.p) %>%
layout(scene = list(xaxis = list(title = 'Study Time'),
yaxis = list(title = 'Mothers Education'),
zaxis = list(title = 'Final Grades for Portuguese Language Course ')))
p
I will use failures, going out, and weekday alcohl in my model, because when I ran my correlation matrix and plotted my variables, these three seemed to have the most impact on weekend alcohol consumption.
fit3 <- lm(Walc ~ failures + goout + Dalc, data = MasterR)
summary(fit3)
##
## Call:
## lm(formula = Walc ~ failures + goout + Dalc, data = MasterR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0133 -0.6250 -0.2592 0.4526 2.7986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4501 0.3356 -1.341 0.18366
## failures 0.6790 0.2441 2.782 0.00673 **
## goout 0.2882 0.0888 3.245 0.00171 **
## Dalc 1.2106 0.1826 6.630 3.46e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8656 on 81 degrees of freedom
## Multiple R-squared: 0.4941, Adjusted R-squared: 0.4753
## F-statistic: 26.37 on 3 and 81 DF, p-value: 5.351e-12
Interpretation:
We can see that the p-value of the F-statistic is 5.351e-12, which is highly significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable of weekend alcohol consumption.
Also: In this new model, with failures, goingout, and Dalc as predictor variables, the adjusted R2 = 0.4753 . This means that “47% of the variance in the measure of Weekend Alcohol consumption can be predicted by amount of failures in classes, number of times the student go’s out during the week, and by weekday alcohol consumption”
Looking at beta coefficients
summary(fit3)$coefficient
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4500896 0.33563175 -1.341022 1.836617e-01
## failures 0.6790207 0.24411888 2.781516 6.728045e-03
## goout 0.2881807 0.08879627 3.245414 1.705979e-03
## Dalc 1.2105529 0.18258589 6.630046 3.457956e-09
By looking at the model we can see that for predicting weekend alcohol consumption, all variables are significant for our model. We can also see that Dalc (Weekday alcohol consumption) has the a greatest impact on weekend alcohol consumption by a large margin. going out seems to have the least impact on weekend alcohol consumption, so it might be good to run another model without it.
(I Could also add variables to this model, but not many other predictor variables were significantlly correlated)
Because the goingout variable isn’t as significant in this model as the other variables I am going to run a model without it!
fit4 <- lm(Walc ~ failures + Dalc, data = MasterR)
summary(fit4)
##
## Call:
## lm(formula = Walc ~ failures + Dalc, data = MasterR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8803 -0.5435 -0.5435 0.4565 3.4565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2799 0.2632 1.063 0.29073
## failures 0.8096 0.2544 3.182 0.00206 **
## Dalc 1.2636 0.1921 6.577 4.19e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9145 on 82 degrees of freedom
## Multiple R-squared: 0.4283, Adjusted R-squared: 0.4143
## F-statistic: 30.71 on 2 and 82 DF, p-value: 1.109e-10
In this model, with failures and Dalc as predictor variables, the adjusted R2 = 0.4143. This means that “41% of the variance in the measure of Weekend Alcohol consumption can be predicted by amount of failures in classes and by weekday alcohol consumption”
This is a smaller R-squared value compared to my first model, so I will compare them with anova!
anova(fit3, fit4)
## Analysis of Variance Table
##
## Model 1: Walc ~ failures + goout + Dalc
## Model 2: Walc ~ failures + Dalc
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 60.689
## 2 82 68.581 -1 -7.8917 10.533 0.001706 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The resulting p-value is sufficiently low (less than 0.05), we conclude that the more complex model is significantly better than the simpler model, and thus favor the more complex model.
We can see from the table, that we have a Df of -1 (indicating that the less complex model has one less parameter). It also shows a p-value of (0.001706). Meaning that removing the goingout Variable from the model did not lead to a significantly improved fit over the first model and we should favor the more complex model.
library(jtools)
library(ggstance)
plot_summs(fit3, scale = TRUE, fit4, plot.distributions = TRUE)
par(mfrow=c(2,2))
plot(fit3)
I don’t see any distinct patterns, so I will move on to plotting my model
library(ggplot2)
library(jtools)
library(ggstance)
effect_plot(fit3, pred = Dalc, interval = TRUE, plot.points = TRUE)
effect_plot(fit3, pred = goout, interval = TRUE, plot.points = TRUE)
effect_plot(fit3, pred = failures, interval = TRUE, plot.points = TRUE)
plot_summs(fit3, scale = TRUE)
plot_summs(fit3, scale = TRUE, plot.distributions = TRUE, inner_ci_level = .9)
Making the two predictors and the response into specific objects:
var_4 <- Master$Dalc
var_6 <- Master$failures
var_7 <- Master$Walc
Plot time!
library(plotly)
a <- plot_ly(Master,
x = var_4,
y = var_6,
z = var_7) %>%
add_markers(color = ~Walc) %>%
layout(scene = list(xaxis = list(title = 'Weekday Alcohol Consumption'),
yaxis = list(title = 'Class Failures'),
zaxis = list(title = 'Weekend Alcohol Consumption')))
a
Conclusion for Multiple Regression:
For Grades in Portuguese language course
For a 1 Unit increase in Mother’s education level, with all other predictors held constant, we would expect to see a 0.6894 increase in Grades for the Portuguese language course
And for a 1 Unit increase in study time, with all other predictors held constant, we would expect to see a 0.9256 increase in Grades for the Portuguese language course
For Weekend Alcohol Consumption:
For a 1 Unit increase failures, with all other predictors held constant, we would expect to see a 0.6790207 increase in weekend alcohol consimption
And for a 1 Unit increase in going out, with all other predictors held constant, we would expect to see a 0.2881807 increase in weekend alcohol consimption
And for a 1 Unit increase in Daily Alcohol Consumption, with all other predictors held constant, we would expect to see a 1.2105529 increase in weekend alcohol consimption
This assignment was fun, I learned a lot in terms of how to create a story and move through a statistical analysis.
In the future, it would be cool to run more regression models, because there are many variables to predict from and I didn’t look at grades for the math course.
(Also logestic regression with the categorical variables would be an intresting analysis)